Introduction

This code is part of a data analysis that predicts the taxi tips to riders in NYC. It uses historical data coming from the NYC Taxi & Limousine Commission. For this exercise, we use data corresponding to March, June and November 2017. This document presents the data exploration, the issues identified and the solutions applied.

We start by setting the environment and loading the data.

library (dplyr)
library (lubridate)
library (skimr)
library (ggplot2)
library (rgdal)
library (leaflet)

## We load and bind together March, June and November 2017 data. 
yellowtripdata <- dplyr::as_tibble(read.csv("data/raw_data/yellow_tripdata_2017-03.csv"))
yellowtripdata <- rbind (yellowtripdata , read.csv("data/raw_data/yellow_tripdata_2017-06.csv"))
yellowtripdata <- rbind (yellowtripdata , read.csv("data/raw_data/yellow_tripdata_2017-11.csv"))

Data description

Field Name Description
VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
PULocationID TLC Taxi Zone in which the taximeter was engaged.
DOLocationID TLC Taxi Zone in which the taximeter was disengaged.
RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate; 2=JFK; 3=Newark; 4=Nassau/Westchester; 5=Negotiated; 6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip; N= not a store and forward trip.
Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card; 2= Cash; 3= No charge; 4= Dispute; 5= Unknown; 6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.

Data quality assessment

We validate that data has been properly loaded.

str (yellowtripdata)
## tibble [29,236,424 × 17] (S3: tbl_df/tbl/data.frame)
##  $ VendorID             : int [1:29236424] 2 2 2 2 2 1 1 1 1 2 ...
##  $ tpep_pickup_datetime : chr [1:29236424] "2017-03-09 21:30:11" "2017-03-09 21:47:00" "2017-03-09 22:01:08" "2017-03-09 22:16:05" ...
##  $ tpep_dropoff_datetime: chr [1:29236424] "2017-03-09 21:44:20" "2017-03-09 21:58:01" "2017-03-09 22:11:16" "2017-03-10 06:26:11" ...
##  $ passenger_count      : int [1:29236424] 1 1 1 1 1 1 1 1 1 1 ...
##  $ trip_distance        : num [1:29236424] 4.06 2.73 2.27 3.86 3.45 2.8 6 8.7 3.7 4.21 ...
##  $ RatecodeID           : int [1:29236424] 1 1 1 1 1 1 1 1 1 1 ...
##  $ store_and_fwd_flag   : chr [1:29236424] "N" "N" "N" "N" ...
##  $ PULocationID         : int [1:29236424] 148 48 79 237 41 261 87 142 68 261 ...
##  $ DOLocationID         : int [1:29236424] 48 107 162 41 162 79 142 181 141 163 ...
##  $ payment_type         : int [1:29236424] 1 2 1 1 2 1 1 1 1 1 ...
##  $ fare_amount          : num [1:29236424] 14 11.5 10 12 12 12.5 19.5 30 16.5 20.5 ...
##  $ extra                : num [1:29236424] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
##  $ mta_tax              : num [1:29236424] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num [1:29236424] 3.06 0 2.82 3.99 0 1 3.5 7.8 1.5 4.26 ...
##  $ tolls_amount         : num [1:29236424] 0 0 0 0 0 0 0 0 0 0 ...
##  $ improvement_surcharge: num [1:29236424] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num [1:29236424] 18.4 12.8 14.1 17.3 13.3 ...

We transform the timestamps and categorical variables to a proper format. Then we run the command skim to get a snapshot of the dataframe.

yellowtripdata <- yellowtripdata %>% 
                    mutate (tpep_pickup_datetime  = ymd_hms (tpep_pickup_datetime),
                            tpep_dropoff_datetime = ymd_hms (tpep_dropoff_datetime),
                            VendorID           = factor (VendorID),
                            RatecodeID         = factor (RatecodeID),
                            store_and_fwd_flag = factor (store_and_fwd_flag),
                            PULocationID       = factor (PULocationID),
                            DOLocationID       = factor (DOLocationID),
                            payment_type       = factor (payment_type)
                         )
skim (yellowtripdata)
Data summary
Name yellowtripdata
Number of rows 29236424
Number of columns 17
_______________________
Column type frequency:
factor 6
numeric 9
POSIXct 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
VendorID 0 1 FALSE 2 2: 15960277, 1: 13276147
RatecodeID 0 1 FALSE 7 1: 28407414, 2: 652907, 5: 97859, 3: 61946
store_and_fwd_flag 0 1 FALSE 2 N: 29111958, Y: 124466
PULocationID 0 1 FALSE 262 237: 1159313, 161: 1097686, 236: 1053298, 186: 1010489
DOLocationID 0 1 FALSE 263 236: 1090520, 161: 1082557, 237: 1016790, 170: 934023
payment_type 0 1 FALSE 5 1: 19844531, 2: 9194104, 3: 154080, 4: 43708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
passenger_count 0 1 1.62 1.26 0.00 1.00 1.00 2.00 192.00 ▇▁▁▁▁
trip_distance 0 1 2.92 4.48 0.00 0.97 1.60 3.01 9496.98 ▇▁▁▁▁
fare_amount 0 1 13.11 147.28 -550.00 6.50 9.50 14.50 630461.82 ▇▁▁▁▁
extra 0 1 0.33 0.46 -53.71 0.00 0.00 0.50 69.80 ▁▁▇▁▁
mta_tax 0 1 0.50 0.07 -0.50 0.50 0.50 0.50 140.00 ▇▁▁▁▁
tip_amount 0 1 1.87 2.65 -112.00 0.00 1.36 2.46 450.00 ▅▇▁▁▁
tolls_amount 0 1 0.33 1.97 -17.50 0.00 0.00 0.00 1018.95 ▇▁▁▁▁
improvement_surcharge 0 1 0.30 0.01 -0.30 0.30 0.30 0.30 1.00 ▁▁▇▁▁
total_amount 0 1 16.45 147.52 -550.30 8.75 11.80 17.80 630463.12 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
tpep_pickup_datetime 0 1 2001-01-01 00:04:13 2041-11-15 02:57:16 2017-06-14 07:51:59 7021748
tpep_dropoff_datetime 0 1 2001-01-01 00:04:51 2041-11-15 03:12:19 2017-06-14 08:05:43 7032908
Issues observed Cause Plan to resolve it Observations
Negative $ amounts payment type payment_type == 1 OR payment_type == 2 AND force any amount to be >= 0
max (passenger_count) == 192 / outliers wrong input 0 < passenger_count < 6 link
trip_distance outliers ? Apply the interquartile range method to remove outliers
fare_amount outliers ? Apply the interquartile range method to remove outliers
tip_amount outliers ? Apply the interquartile range method to remove outliers
outlier dates e.g. year == 2001 meter has wrong date setting year==2017
tolls_amount outliers ? tolls_amount < 125 The interquantile method seems to fail, probably due to a non-continuous distribution.

Data cleaning

yellowtripdata_clean <- yellowtripdata %>% 
               filter (payment_type == 1 | payment_type == 2,
                       trip_distance > 0. ,
                       passenger_count > 0 & passenger_count < 6,
                       fare_amount > 0.,
                       extra >= 0.,
                       mta_tax >= 0.,
                       tip_amount >= 0.,
                       tolls_amount >= 0.,
                       improvement_surcharge >= 0.,
                       total_amount > 0. ,
                       year (tpep_pickup_datetime) == 2017 & year (tpep_dropoff_datetime) == 2017
                 )

## To remove outliers we apply the interquartile range method 
## to the variables: trip_distance, fare_amount and tip_amount.
Interquartile_range_method <- function (df, variable) {
     ## This function applies the interquartile range method to 
     ## remove outliers from a "variable" in the dataframe "df".
     ## It returns a dataframe free of outliers.
     Q1 <- quantile(variable, .25)
     Q3 <- quantile(variable, .75)
     IQR <- IQR(variable)  
     no_outliers <- subset(df, variable> (Q1 - 1.5*IQR) & variable< (Q3 + 1.5*IQR))
     return (no_outliers)
}
yellowtripdata_clean  <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$trip_distance)
yellowtripdata_clean  <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$fare_amount)
yellowtripdata_clean  <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$tip_amount)
yellowtripdata_clean  <- yellowtripdata_clean %>% filter (tolls_amount < 125.)

skim (yellowtripdata_clean)
Data summary
Name yellowtripdata_clean
Number of rows 24044303
Number of columns 17
_______________________
Column type frequency:
factor 6
numeric 9
POSIXct 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
VendorID 0 1 FALSE 2 2: 12845255, 1: 11199048
RatecodeID 0 1 FALSE 7 1: 24039893, 5: 2941, 3: 827, 4: 597
store_and_fwd_flag 0 1 FALSE 2 N: 23954727, Y: 89576
PULocationID 0 1 FALSE 260 237: 1054410, 161: 950254, 236: 946389, 186: 879831
DOLocationID 0 1 FALSE 261 236: 996775, 161: 954776, 237: 934094, 170: 838995
payment_type 0 1 FALSE 2 1: 16217689, 2: 7826614, 3: 0, 4: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
passenger_count 0 1 1.49 1.02 1.00 1.00 1.00 2.00 5.00 ▇▂▁▁▁
trip_distance 0 1 1.74 1.14 0.01 0.90 1.40 2.27 6.09 ▇▇▂▁▁
fare_amount 0 1 9.42 4.10 0.01 6.00 8.50 12.00 21.30 ▁▇▅▂▁
extra 0 1 0.32 0.37 0.00 0.00 0.00 0.50 64.20 ▇▁▁▁▁
mta_tax 0 1 0.50 0.01 0.00 0.50 0.50 0.50 7.68 ▇▁▁▁▁
tip_amount 0 1 1.33 1.24 0.00 0.00 1.26 2.15 5.37 ▇▅▃▁▁
tolls_amount 0 1 0.01 0.26 0.00 0.00 0.00 0.00 101.52 ▇▁▁▁▁
improvement_surcharge 0 1 0.30 0.00 0.00 0.30 0.30 0.30 0.30 ▁▁▁▁▇
total_amount 0 1 11.89 4.79 0.31 8.16 10.80 14.76 117.82 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
tpep_pickup_datetime 0 1 2017-03-01 2017-12-01 23:28:58 2017-06-13 23:27:13 6742524
tpep_dropoff_datetime 0 1 2017-03-01 2017-12-02 00:00:29 2017-06-13 23:39:01 6752360

Payment_type

It is strange to observe a p25 of 0 for tip_amount when in the US is very rare that people do not tip a driver. As it is described, cash tips are not included, when customer pay in cash drivers prefer to not or can not record the tip. Let us check it out…

yellowtripdata_clean %>%
     ggplot (aes (x= tip_amount)) + geom_histogram(bins = 30) + facet_grid( payment_type~. )

## We need to remove the payments done in cash.
yellowtripdata_clean <- yellowtripdata_clean %>% filter (payment_type == 1)

Timestamps

To finish we add extra features from the timestamps: hour, month, weekday (Sunday =1,…,Saturday = 7) and trip time. From trip time and distance we create a trip_mean_speed [miles/min]. We also remove the outliers from the trip time using the same method as before and visualize the result.

yellowtripdata_clean <- yellowtripdata_clean %>% mutate (
                                pickup_hour   = factor (hour (tpep_pickup_datetime)),
                                dropoff_hour  = factor (hour (tpep_dropoff_datetime)),
                                pickup_wday   = factor (wday (tpep_pickup_datetime)),
                                dropoff_wday  = factor (wday (tpep_dropoff_datetime)),
                                pickup_month  = factor (month (tpep_pickup_datetime)),
                                dropoff_month = factor (month (tpep_dropoff_datetime)),
                                trip_time_min = as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime , units="min")),
                                trip_mean_speed = trip_distance/trip_time_min
                                )
yellowtripdata_clean  <- Interquartile_range_method (yellowtripdata_clean , yellowtripdata_clean$trip_time_min)
yellowtripdata_clean  %>% ggplot (aes (x=trip_time_min)) + geom_histogram(bins = 30)

yellowtripdata_clean %>% ggplot (aes (x = trip_mean_speed)) + geom_histogram(bins =30) + xlab ("trip_mean_speed [miles/min]")
## Warning: Removed 124 rows containing non-finite values (stat_bin).

We ensure that the mean speed of the trip makes sense by required a trip distance larger than 1 mile and longer than 5 min.

yellowtripdata_clean <- yellowtripdata_clean %>% filter (trip_distance > 1 & trip_time_min >5)

yellowtripdata_clean %>% filter (trip_distance > 1 & trip_time_min >5) %>% 
  ggplot (aes (x = trip_mean_speed)) + 
  geom_histogram(bins =30) + 
  xlab ("trip_mean_speed [miles/min]")

NYC map

To visualize the results we are going to create a NYC divided and represent some results directly in the map using the district numbers provided in the data.

We start by loading a shapefile from NYC taxi zones. This data is freely available at NYC Open Data webpage.

##
ny_taxi_zones <- readOGR("data/NYC Taxi Zones/geo_export_a3336026-bd63-4f88-a657-06200b5a5e82.shp")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS, : Discarded datum WGS84 in CRS definition: +proj=longlat +ellps=WGS84
## +no_defs
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/garoe.gonzalez/Desktop/analysis/data/NYC Taxi Zones/geo_export_a3336026-bd63-4f88-a657-06200b5a5e82.shp", layer: "geo_export_a3336026-bd63-4f88-a657-06200b5a5e82"
## with 263 features
## It has 6 fields
leaflet(ny_taxi_zones) %>%
     addTiles() %>% 
     addProviderTiles("CartoDB.Positron") %>% 
     addPolygons(popup = ~zone, stroke = TRUE, color = "black", weight = 0.2)

Enriching the map with taxi trips information

We add the mean tip given by customers that finish their trip in a certain taxi zone. We could have chosen the pickup location or both, but we decided to just use drop-off for simplifying the exercise, and because tip is given then.

## We extract all the info we can get from the neighborhood polygons.
zones_info <-data.frame (borough   = ny_taxi_zones$borough,
                         location_i= ny_taxi_zones$location_i,
                         zone      = ny_taxi_zones$zone)

## We compute the tip mean per drop location ID.
tip_DO_ID <- yellowtripdata %>% group_by (DOLocationID) %>% 
                                 summarize (tip_mean = mean (tip_amount)) %>% 
                                 mutate   (DOLocationID = as.integer (DOLocationID))
## `summarise()` ungrouping output (override with `.groups` argument)
## We join the data back to the neighborhood names.
zones_info<- zones_info %>% left_join(tip_DO_ID, by = c ("location_i"  = "DOLocationID"))

## We add the tip information to the shapefile polygons.
ny_taxi_zones$tip_mean <- zones_info$tip_mean

Features removal

Since we only end up using one payment_type and we have extracted all relevant information from the timestamps, we remove the variables from our dataframe. We also filter out the VendorID which do not bring relevant information about the tip, customer or trip.

yellowtripdata_clean <- yellowtripdata_clean %>% select (-payment_type,-tpep_dropoff_datetime,
                                                         -tpep_pickup_datetime, -VendorID)

Conclusion

We have reduced our data from 29236424 to 10876305 events (37%) and we are going to use in the next phase of the analysis 21 variables.

Issues observed Action
Negative $ amounts payment_type == 1 AND force any amount to be >= 0
max (passenger_count) == 192 / outliers 0 < passenger_count < 6
trip_distance outliers and very small Interquartile range method & > 1 mile
fare_amount outliers Interquartile range method
tip_amount outliers Interquartile range method
outlier dates e.g. year == 2001 year==2017
tolls_amount outliers tolls_amount < 125
tip_amount == 0 payment_type == 1
trip_time_min outliers and very small Interquartile range method & > 5 min

We save the results in R data objects to be easily loaded by the next phase of the analysis.

saveRDS (yellowtripdata_clean, "data/clean_data/YellowTripCleanData_03_06_11_2017.rds")
saveRDS (ny_taxi_zones, "data/clean_data/ny_taxi_zones_enrich_map.rds")